Abstract
Introduction: Early and accurate identification of acute leukemia (AL) subtypes—acute lymphoblastic leukemia (ALL), acute myeloid leukemia (AML), and acute promyelocytic leukemia (APL)—is critical for time-sensitive therapeutic decisions and prognosis. AI-PAL (Acute Leukemia Prediction Algorithm), publicly available at , is a machine learning–based model that predicts AL subtype using routine clinical and laboratory data (Alcazer et al., Lancet Digit Health, 2024). This tool may be especially valuable in resource-limited settings where rapid access to definitive diagnostics is constrained.
In line with emerging international frameworks for responsible AI in health care, such as the TRAIN-Europe initiative (van Genderen et al., JAMA 2025), which emphasizes local performance evaluation, bias assessment, and postmarket surveillance as essential components of AI implementation, external validation of predictive algorithms in real-world settings is not only scientifically relevant but ethically imperative. These frameworks stress the need for institutions to rigorously assess AI tools using their own patient populations before clinical adoption.
Findings from a German group (Sauer et al., Blood 2024) further reinforce the risks of unvalidated implementation and the potential for reduced accuracy when applied to different cohorts. We therefore aimed to externally validate the AI-PAL model in a retrospective cohort of patients with AL at a tertiary center in Brazil.
Methods: We retrospectively identified adult patients diagnosed with AML, APL, or ALL at Hospital Israelita Albert Einstein (May 2010 – November 2025). AI-PAL was applied for each patient to the first lab panel and compared with the WHO 2016 diagnosis.The primary endpoint was overall accuracy (OA). Secondary endpoints included per-class sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and one-vs-rest and multiclass AUCs (Hand–Till, macro, macro-weighted). Continuous baseline variables were compared between correctly and incorrectly classified cases using Wilcoxon rank-sum tests with Benjamini–Hochberg correction. Confidence intervals (CI) for proportions were calculated using Clopper–Pearson; accuracy and AUCs used stratified nonparametric bootstrap with 2,000 resamples (bias-corrected and accelerated [BCa]).
Results: We included 106 patients (AML = 83; ALL = 17; APL = 6). AI-PAL returned predictions for 105 patients (one ALL case lacked sufficient data). OA was 83.8% (95% CI [BCa], 74.8–89.5%). Most misclassifications involved AML misclassified as ALL (n = 12; 70.6%). In exploratory analyses, correctly classified cases were older (median 67.5 vs. 46.0 years; Hodges–Lehmann shift +13.0, 95% CI 4.0–18.5; p = 0.006), though no differences remained significant after multiple testing correction (BH-adjusted p = 0.064 for age; all others ≥ 0.35).
Per-class proportions (one-vs-rest):
* ALL: sensitivity 82.4% (95% CI 56.6–96.2%), specificity 86.4% (77.4–92.8%), PPV 53.8% (33.4–73.4%), NPV 96.2% (89.3–99.2%).
* AML: sensitivity 82.9% (73.0–90.3%), specificity 87.0% (66.4–97.2%), PPV 95.8% (88.1–99.1%), NPV 58.8% (40.7–75.4%).
* APL: sensitivity 100% (54.1–100%), specificity 98.0% (92.9–99.8%), PPV 75.0% (34.9–96.8%), NPV 100% (96.3–100%).
Per-class discrimination (one-vs-rest AUC) was high: ALL AUC 0.893 (95% CI [BCa] 0.755–0.962), AML AUC 0.903 (0.780–0.959), APL AUC 1.000 (CI not estimable due to perfect separation/small n). Multiclass discrimination was also high: Hand–Till AUC 0.945 (95% CI [BCa] 0.871–0.980), macro-AUC 0.932 (0.847–0.975), macro-weighted AUC 0.907 (0.800–0.964).
Conclusions: In this independent, single-center Latin American cohort, AI-PAL demonstrated strong discriminative performance and good OA in predicting acute leukemia subtypes using routine laboratory data. Results were comparable to those of the original study, though lower point accuracy likely reflects class imbalance and the low prevalence of APL. AI-PAL offers immediate, probability-based triage from standard lab tests and may assist early clinical decisions, particularly in settings where cytomorphology, flow cytometry, or molecular diagnostics are delayed. Further multicenter validation in Latin America is ongoing and will be presented.
Keywords: artificial intelligence, acute myeloid leukemia, acute lymphoblastic leukemia, diagnostics, external validation
This feature is available to Subscribers Only
Sign In or Create an Account Close Modal